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Abstract 

We conduct an empirical study to test the ability of con¬ 
volutional neural networks (CNNs) to reduce the effects of 
nuisance transformations of the input data, such as loca¬ 
tion, scale and aspect ratio. We isolate factors by adopting 
a common convolutional architecture either deployed glob¬ 
ally on the image to compute class posterior distributions, 
or restricted locally to compute class conditional distribu¬ 
tions given location, scale and aspect ratios of bounding 
boxes determined by proposal heuristics. In theory, averag¬ 
ing the latter should yield inferior performance compared 
to proper marginalization. Yet empirical evidence suggests 
the converse, leading us to conclude that - at the current 
level of complexity of convolutional architectures and scale 
of the data sets used to train them - CNNs are not very effec¬ 
tive at marginalizing nuisance variability. We also quantify 
the effects of context on the overall classification task and 
its impact on the performance of CNNs, and propose im¬ 
proved sampling techniques for heuristic proposal schemes 
that improve end-to-end performance to state-of-the-art lev¬ 
els. We test our hypothesis on a classification task using 
the ImageNet Challenge benchmark and on a wide-baseline 
matching task using the Oxford and Fischer’s datasets. 


1. Introduction 

Convolutional neural networks (CNNs) are the de-facto 
paragon for detecting the presence of objects in a scene, as 
portrayed by an image. CNNs are described as being “ap¬ 
proximately invariant” to nuisance transformations such as 
planar translation, both by virtue of their architecture (the 
same operation is repeated at every location akin to a “slid¬ 
ing window” and is followed by local pooling) and by virtue 
of their approximation properties that, given sufficient pa¬ 
rameters and transformed training data, could in principle 
yield discriminants that are insensitive to nuisance transfor¬ 
mations of the data represented in the training set. In ad¬ 


dition to planar translation, an object detector must manage 
variability due to scaling (possibly anisotropic along the co¬ 
ordinate axes, yielding different aspect ratios) and (partial) 
occlusion. Some nuisances are elements of a transforma¬ 
tion group, e.g., the (anisotropic) location-scale group for 
the case of position, scale and aspect ratio of the object’s 
support.' The fact that convolutional architectures appear 
effective in classifying images as containing a given object 
regardless of its position, scale, and aspect ratio [28, 40] 
suggests that the network can effectively manage such nui¬ 
sance variability. 

However, the quest for top performance in benchmark 
datasets has led researchers away from letting the CNN 
manage all nuisance variability. Instead, the image is first 
pre-processed to yield proposals, which are subsets of the 
image domain (bounding boxes) to be tested for the pres¬ 
ence of a given class (Regions-with-CNN [19]). Proposal 
mechanisms aim to remove nuisance variability due to po¬ 
sition, scale and aspect ratio, leaving a “Category CNN” to 
classify the resulting bounding box as one of a number of 
classes it is trained with. Put differently, rather than com¬ 
puting the posterior distribution^ with nuisance transforma¬ 
tions automatically marginalized, the CNN is used to com¬ 
pute the conditional distribution of classes given the data 
and a sample element that approximates the nuisance trans- 

*The region of the image the objects projects onto, often approximated 
by a bounding box. 

^One can think of the conditional distribution of a class c given 
an image x, p{c\x), as defined by a CNN, as the class posterior 
Sq P{e\^, 9)d'P{g\^) marginalized with respect to the nuisance group G. 
If the nuisances are known, one can use the class-conditionals p{c\x, gO 
at each nuisance gr & G m order to approximate p{c\x) with a weighted 
average of conditionals, i.e., p{c\x) ~ f2^p{c\x, gr)p{gr\x). 

When a CNN is tested on a proposal r G x determined by a ref¬ 
erence frame Xr, it computes p(c|a;|^) (x restricted to r), which is 
an approximation of p(c\x,gr). Then, explicit marginalization (assum¬ 
ing uniform weights) computes L]rP('^l*|r) "'hich is different from 

9r) which in turn is different from f2^p(c\x, gr}p(9r\x}. 
This approach is therefore, on average, a lower bound on proper marginal¬ 
ization, and the fact that it would outperform the direct computation of 
p(c\x) is worth investigating empirically. 
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formation, represented by a bounding box. If the goal is 
the nuisance itself (object support, as in detection [10]) it 
can be found via maximum-likelihood (max-out) by select¬ 
ing the bounding box that yields the highest probability of 
any class [19, 22]. If the goal is the class regardless of the 
transformation (as in categorization [10]), the nuisance can 
be approximately marginalized out by averaging the con¬ 
ditional distributions with respect to an estimation of the 
nuisance transformations^. 

Now, if a CNN was an effective way of computing the 
marginals with respect to nuisance variability, there would 
be no benefit in conditioning and averaging with respect to 
(inferred) nuisance samples. This is a direct corollary of 
the Data Processing Inequality (DPI, Theorem 2.8.1 in [9]). 
Proposals are subsets of the whole image, so in theory less 
informative even after accounting for resolution/sampling 
artifacts (Fig. 1). A fortiori, performance should further 
decrease if the conditioning mechanism is not very rep¬ 
resentative of the nuisance distribution, as is the case for 
most proposal schemes that produce bounding boxes based 
on adaptively downsampling a coarse discretization of the 
location-scale group [24]. Class posteriors conditioned on 
such bounding boxes discard the image outside it, further 
limiting the ability of the network to leverage on side in¬ 
formation, or “context”. Should the converse be true, i.e., 
should averaging conditional distributions restricted to pro¬ 
posal regions outperform a CNN operating on the entire im¬ 
age, that would bring into question the ability of a CNN 
to marginalize nuisances such as translation and scaling or 
else go against the DPI. In this paper we test this hypothesis, 
aiming to answer to the question: How effective are current 
CNNs to reduce the effects of nuisance transformations of 
the input data, such as location and scaling? 

To the best of our knowledge, this has never been done in 
the literature, despite the keen interest in understanding the 
properties of CNNs [20, 21, 34, 39, 43, 46, 47] following 
their empirical success. We are cognizant of the dangers of 
drawing sure conclusions from empirical evaluations, espe¬ 
cially when they involve a myriad of parameters and exploit 
training sets that can exhibit biases. To this end, in Sect. 2 
we describe a testing protocol that uses recognized existing 
modules, and keep all factors constant while testing each 
hypothesis. 

1.1. Contributions 

We first show that a baseline (AlexNet [28]) with single¬ 
model top-5 error of 19.96% on ImageNet 2014 Classifi¬ 
cation slightly decreases in performance (to 20.41%) when 
constrained to the ground-truth bounding boxes (Table 1). 
This may seem surprising at hrst, as it would appear to 
violate Theorem 2.6.5 of [9] (on average, conditioning on 
the true value of the nuisance transformation must reduce 
uncertainty in the classifier). However, note that the re¬ 


striction to bounding boxes does not just condition on the 
location-scale group, but also on visibility, as the image out¬ 
side the bounding box is ignored. Thus, the slight decrease 
in performance measures the loss from discarding context 
by ignoring the image beyond the bounding box. When we 
pad the true bounding boxes with a 10-pixel rim, we show 
that, conditioned on such “ground-truth-with-context” in¬ 
deed does decrease the error as expected, to 17.65%. In 
Fig. 1 we show the classihcation performance as a function 
of the rim size all the way to the whole image for AlexNet 
and VGG16 [40]. A 25% rim yields the lowest top-5 errors 
on the ImageNet validation set for both models. This also 
indicates that the context effectively leveraged by current 
CNN architectures is limited to a relatively small neighbor¬ 
hood of the object of interest. 

The second contribution concerns the proper sampling 
of the nuisance group. If we interpret the CNN restricted 
to a bounding box as a function that maps samples of 
the location-scale group to class-conditional distributions, 
where the proposal mechanism down-samples the group, 
then classical sampling theory [38] teaches that we should 
retain not the value of the function at the samples, but its lo¬ 
cal average, a process known as anti-aliasing. Also in Table 
1, we show that simple uniform averaging of 4 and 8 sam¬ 
ples of the isotropic scale group (leaving location and as¬ 
pect ratio constant) reduces the error to 15.96% and 14.43% 
respectively. This is again unintuitive, as one expects that 
averaging conditional densities would produce less discrim¬ 
inative classihers, but in line with recent developments con¬ 
cerning “domain-size pooling” [12]. 

To test the effect of such anti-aliasing on a CNN absent 
the knowledge of ground truth object location, we follow 
the methodology and evaluation protocol of [16] to develop 
a domain-size pooled CNN and test it in their benchmark 
classihcation of wide-baseline correspondence of regions 
selected by a generic low-level detector (MSER [32]). Our 
third contribution is to show that this procedure improves 
the baseline CNN by 5-15% mean AP on standard bench¬ 
mark datasets (Table 3 and Fig. 5 in Sect. 2.2). 

Our fourth contribution goes towards answering the 
question set forth in the preamble: We consider two popular 
baselines (AlexNet and VGG16) that perform at the state- 
of-the-art in the ImageNet Classihcation challenge and in¬ 
troduce novel sampling and pruning methods, as well as an 
adaptively weighted marginalization based on the inverse 
Renyi entropy. Now, if averaging the conditional class pos¬ 
teriors obtained with various sampling schemes should im¬ 
prove overall performance, that would imply that the im¬ 
plicit “marginalization” performed by the CNN is inferior 
to that obtained by sampling the group, and averaging the 
resulting class conditionals.^ This is indeed our observation, 
e.g., for VGG16, as we achieve an overall performance of 
8.02%, compared to 13.24% when using the whole image 


Method 

AlexNet 

VGG16 

Whole image 

19.96 

13.24 

Ground-Truth Bounding Box (GT) 

20.41 

12.44 


Isotropically 

Anisotropically 

Isotropically 

Anisotropically 

GT padded with 10 px 

17.66 

17.65 

10.91 

10.30 

Ave-GT, 4 domain sizes (padded with [0,30] px) 

15.96 

16.00 

9.65 

8.90 

Ave-GT, 8 domain sizes (padded with [0,70] px) 

14.43 

14.22 

8.66 

7.84 


Table 1. AlexNet’s and VCC16’s top-5 error on the ImageNet 2014 classification challenge when the ground-truth localization is provided, 
compared to applying the model on the entire image. We pad the ground truth with various rim sizes both isotropically and anisotropically. 
Then we show how averaging the class posteriors performs when applying the network on concentric domain sizes around the ground truth. 


(Table 2). There are, however, caveats to this answer, which 
we discuss in Sect. 3. 

Our hfth contribution is to actually provide a method 
that performs at the state of the art in the ImageNet Clas- 
sihcation challenge when using a single model. In Ta¬ 
ble 2 we provide various results and time complexity. We 
achieve a top-5 classification error of 15.82% and 8.02% 
for AlexNet and VGG16, compared to 17.55% and 8.85% 
error when they are tested with 150 regularly sampled crops 
[40], which corresponds to 9.9% and 9.4% relative er¬ 
ror reduction, respectively. Data augmentation techniques 
such as scale jittering and an ensemble of several models 
[23, 40, 42] could be deployed along with our method. 

The source code implementing our method and the 
scripts necessary to reproduce the evaluation are available 
at http://vision.ucla.edu/'nick/proj/cnn_ 
nuisances/. 

1.2. Related work 

The literature on CNNs and their role in Computer Vi¬ 
sion is rapidly evolving. Attempts to understand the inner 
workings of CNNs are being conducted [6, 20, 21, 29, 34, 
39, 43, 46, 47], along with theoretical analysis [2, 4, 8, 41] 
aimed at characterizing their representational properties. 
Such intense interest was sparked by the surprising per¬ 
formance of CNNs [6, 11, 19, 23, 28, 36, 37, 40, 42] in 
Computer Vision benchmarks [10, 15], where many couple 
a proposal scheme [1, 5, 7, 14, 24, 25, 27, 31, 35, 44, 48] 
with a CNN. As our work relates to a vast body of work, we 
refer the reader to references in the papers that describe the 
benchmarks we adopt, namely [6], [28] and [40]. 

Bilen et. al. [3] also explore the idea of introducing pro¬ 
posals in classihcation. However, their approach leverages 
on a signihcantly larger number of candidates and focuses 
on sophisticated classihers and post-normalization of class 
posteriors. Our investigation targets selecting a very small 
subset of the most discriminative candidates among generic 
object proposals, while building on popular CNN models. 


2. Experiments 

2.1. Large-scale Image Classification 

What if we trivialize location and scaling? First, we test 
the hypothesis that eliminating the nuisances of location and 
scaling by providing a bounding box for the object of inter¬ 
est will improve the classification accuracy. This is not a 
given, for restricting the network to operate on a bounding 
box prevents it from leveraging on context outside it. We 
use the AlexNet and VGG16 pretrained models, which are 
provided with the MatConvNet open source library [45], 
and test their top-1 and top-5 classification errors on the 
ImageNet 2014 classification challenge [10]. The valida¬ 
tion set consists of 50, 000 images, where at each of them 
one “salient” class is annotated a priori by a human. How¬ 
ever, other ImageNet classes appear in many of the images, 
which can confound any classifier. 

We test the classifier in various settings (Table 1); first, 
by feeding the entire image to it and letting the classifier 
manage the nuisances. Then we test the ground-truth an¬ 
notated bounding box and concentric regions that include 
it. We try both isotropic and anisotropic expansion of the 
ground-truth region. We observe similar behavior, which is 
also consistent for both models. 

Only for AlexNet at Table 1 using the object’s ground- 
truth support performs slightly worse than using the whole 
image. After we pad the object region with a 10-pixel rim, 
the top-5 classification error decreases fast. However, there 
is a trade-off between context and clutter. Providing too 
much context has diminishing returns. In Fig. 1 we show 
how the errors vary as a function of the rim size around the 
object of interest. Performance starts dropping down when 
we add more than 25% rim size. This padding gives 15.08% 
and 8.37% top-5 error for AlexNet and VGG16, as opposed 
to 19.96% and 13.24% respectively, when classifying the 
whole image. 

To ensure that this improvement is not due to downsam¬ 
pling, we repeat the experiment with fixed resolution for 
the whole image and every subregion. We achieve this by 
shrinking each region with the same downsampling factor 

















AlexNet: Classification error vs. Rim size. 



VGG-16: Classification error vs. Rim size. 



Figure 1. The top-1 and top-5 classification errors in ImageNet 
2014 as a function of the rim size for AlexNet (above) and VGG16 
(below) architecture. A 0 rim size corresponds to the ground-truth 
bounding box, while 1 refers to the whole image. A relatively 
small rim around the ground truth provides the best trade-off be¬ 
tween informative context and clutter. 


that we apply to the whole image to pass to the CNN. Fi¬ 
nally we rescale the downsampled region to the CNN in¬ 
put. These results appear with the label “same resolution” 
in Fig. 1. 

Finally, we apply domain size average pooling on the 
class posterior (i.e., the network’s softmax output layer) 
with 4 and 8 domain sizes that are concentric with the 
ground truth. The added rim has the declared size either at 
both dimensions (for the anisotropic case) or only along the 
minimum dimension (for the isotropic case), and it is uni¬ 
formly sampled in the range [0,30] and [0, 70], respectively. 


The latter one further reduces the top-5 error to 14.22% for 
AlexNet, which is lower than any single domain size {c.f. 
Fig. 1). This suggests that explicitly marginalizing samples 
can be beneficial. Next we test whether the improvement 
stands when using object proposals. 

Introducing object proposals. We deploy a proposal al¬ 
gorithm to generate “object” regions within the image. We 
use Edge Boxes [48], which provide a good trade-off be¬ 
tween recall and speed [24]. 

First, we decide the number of proposals which will pro¬ 
vide a satisfactory cover for the majority of objects present 
in the dataset. In a single image we search for the highest 
Intersection over Union (loU) overlap between the ground- 
truth region and any proposed sample and in turn we eval¬ 
uate the network’s performance on the most overlapping 
sample. We repeat this process for various number of pro¬ 
posals in a small subset of validation set and hnally 
choose N = 80, which provides a satisfactory trade-off be¬ 
tween classification performance and computational cost. 

Among the extracted proposals, we choose the most in¬ 
formative subset for our task, based on pruning criteria that 
we introduce later. Next we discuss what other samples we 
use, which are also drawn in Fig. 2. 

Domain-size pooling and regular crops. We investigate 
the influence of domain-size pooling at test time both as 
stand-alone technique and as additional proposals for the 
final method which is described in Algorithm 1. We de¬ 
ploy domain-size aggregation of the network’s class poste¬ 
rior over D sizes that are uniformly sampled in the range 
[r, 1], where 1 is the normalized size of the original image. 
After parameter search, we choose D = 5 and r = 0.6. 
We use both the original and the horizontally flipped area, 
which gives 10 samples in total. 

Finally, we use standard data augmentation techniques 
from the literature. As customary, the image is isotropically 
rescaled to a predefined size, and then a predetermined se¬ 
lection of crops is extracted [28, 40, 42]. 

Pruning samples. Continuing to sample patches within 
the image has diminishing return in terms of discriminabil- 
ity, while including more background patches with noisy 
class posterior distribution. We adopt an information- 
theoretic criterion to Alter the samples that we use for the 
subsequent approximate marginalization. 

For each proposal n G N we evaluate the network 
and take the normalized softmax output u” G where 
G [0,1], i = {!,..., C} and C = 1, 000 on ILSVRC 
classification. The output is a set of non-negative num¬ 
bers which sum up to 1. We can interpret the vector u" 
as a probability distribution on the discrete space of classes 
{1,... ,C} and compute the Renyi entropy as = 




















Figure 2. Visualizing different sampling strategies. Upper left: 
Object proposals. Generic proposals using Edge Boxes [48]. Up¬ 
per right: Concentric domain sizes are centered at the center of the 
image. Below: Regular crops [28, 40, 42]. 

Our conjecture is that more discriminative class distribu¬ 
tions tend to be more peaky with less ambiguity among the 
classes, and therefore lower entropy. In Fig. 3 we show how 
selecting a subset of image patches whose class posterior 
has lower entropy improves classification performance. 

We extract N candidate object proposals^ [48] and eval¬ 
uate the network for both the original candidates and their 
horizontal flips. Then we keep a small subset E, whose 
posterior distribution has the lowest entropy. We use Renyi 
entropy with relatively small powers (a = 0.35), as we 
found that it encourages selecting regions with more than 
one highly-confident candidate object. While the parameter 
a increases, the entropy is increasingly determined by the 
events of highest probability. Larger a would be more ef¬ 
fective for images with a single object, which is not the case 
in most images in ILSVRC. 

Finally we introduce a weighted average of the selected 
posteriors as where x\^ is the support 

of sample r and p{x^^) is the weight of its posterior^. We 
try both uniform weights and weights proportional to the 
inverse entropy of the posterior p(c|a;|^). The latter is ex¬ 
pected to perform better, as it naturally gives higher weight 
to the most discriminative samples. 

Comparisons. To compare various sampling and infer¬ 
ence strategies, we use the AlexNet and VGG16 models. 
All classification results in Table 2 refer to the validation 
set of the ILSVRC 2014 [10], except for the last row which 
demonstrates results on the test set. On the rows 2-5 
we show the performance of popular multi-crop methods 
[28, 40, 42]. Then we compare them with strategies that 
involve concentric domain sizes (rows 6-8) and object pro¬ 
posals (rows 9-14). 

^ We introduce a prior encouraging the largest proposals among the ones 
that the standard setting in [48] would give. To this end, instead of directly 
extracting, for example, N = 80 proposals, we generate 200 and keep the 
80 largest ones (Algorithm 1). 


Low-entropy class posteriors vs. baselines 



Figure 3. We show the top-5 error as a function of the number of 
proposals we average to produce the final posterior. Samples are 
generated with Algorithm 1 and classified with AlexNet. The blue 
curve corresponds to selecting samples with the lowest-entropy 
posteriors. We compare our method with simple strategies such as 
random selection, ranking by largest-size or highest confidence of 
proposals. The random sample selection was run 10 times and we 
visualize the estimated 99.7% confidence intervals as error-bars. 
Empirically, the discriminative power of the classifier increases 
when the samples are selected with the least entropy criterion. 

Before extracting the crops and in order to preserve the 
aspect ratio of each single image, we rescale it so that its 
minimum dimension is 256. The proposals are extracted 
at the original image resolution and then they are rescaled 
anisotropically to fit the model’s receptive field. Addi¬ 
tionally, some multi-crop algorithms resize the image in 
S different scales and then sample C patches of fixed size 
224 X 224 densely over the image. Szegedy et al. [42] use 
5 = 4 scales and C = 36 crops per scale, which yields 144 


Algorithm 1 Regular & adaptive sampling in classification. 

• Object proposals. We extract several object proposals 
from the image x (e.g., 200 Edge Boxes [48] and keep the 
N largest ones). Among them we choose E proposals 
whose class posterior has the lowest Renyi entropy with 
parameter a. After hyper-parameter search, we choose 
AT = 80, L; = 12 and a = 0.35. 

• D concentric domain sizes around the center of x (in¬ 
cluding their horizontal flip). We use 5 sizes that are uni¬ 
formly extracted in the normalized range [0.6,1], where 
1 corresponds to the whole image (D = 10). 

• C crops. Regular crops; e.g., C = 10 or (7 = 50 in 1 
or 3 scales, as in [28, 40, 42]. 

• The class conditionals are approximated as 
^^p(c|a;p)p(a:|^), where p{x\^) is either uniform 
or equals to the inverse entropy of the posterior p(c|a:|^). 

























































Method 

AlexNet 

VGG16 

^eval 

4fave 

^ crops 

^ sizes 

^ proposals 

top-1 

top-5 

t (s/im) 

top-1 

top-5 

t (s/im) 

- 

D = 1 

- 

43.00 

19.96 

0.01 

33.89 

13.24 

0.06 

1 

1 

C= 10 

- 

- 

41.50 

18.69 

0.06 

27.55 

9.29 

0.48 

10 

10 

O 

II 

- 

- 

41.01 

18.05 

0.66 

27.44 

9.12 

1.34 

50 

50 

(7= 10 X 3 

- 

- 

40.58 

17.97 

0.16 

27.23 

8.88 

1.26 

30 

30 

C = 50 X 3 

- 

- 

40.41 

17.55 

0.82 

27.14 

8.85 

3.48 

150 

150 

- 

£1 = 10 

- 

40.00 

17.86 

0.08 

28.16 

9.46 

0.60 

10 

10 

C= 10 

D = 10 

- 

39.38 

17.08 

0.22 

26.94 

8.83 

1.08 

20 

20 

(7= 10 X 3 

D = 10 

- 

39.36 

17.07 

0.46 

26.76 

8.68 

1.88 

40 

40 

- 

- 

o 

II 

40.18 

17.53 

1.26 

25.60 

8.24 

3.02 

160 

40 

C= 10 

- 

II 

to 

o 

38.91 

16.63 

25.28 

7.91 

170 

30 

- 

D = 10 

E = 12 

38.05 

16.19 

1.34 

25.19 

8.11 

4.38 

170 

22 

C= 10 

D = 10 

E =12 

37.69 

15.83 

25.11 

8.01 

180 

32 

C= 10 

D = 1Q 

E = 12 (fast) 

37.71 

15.88 

0.94 

25.12 

8.08 

3.70 

180 

32 

C= 10 

D = 10 

E = 12 (W, fast) 

37.57 

15.82 

1.28 

25.11 

8.02 

3.80 

180 

32 

C= 10 

D = 10 

E = 12 (test set) 

37.417 

16.018 

- 

25.117 

7.909 

- 

180 

32 


Table 2. Top-1 and top-5 errors on the ImageNet 2014 classification challenge. The rows 2-5 include the common data augmentation 
strategies in the literature [28, 40, 42] (i.e., regular sampling). The next three rows use concentric domain sizes that are uniformly sampled 
in the range [0.6,1] with 1 being the normalized size of the original image (c.f. Fig. 2). Finally, in the last seven rows, we introduce 
adaptive sampling, which consists of a data-driven object proposal algorithm [48] and an entropy criterion to select the most discriminative 
samples on the fly based on the extracted class posterior distribution. The last row shows results on the test set. if^eval stands for the 
number of samples that are evaluated for each method, while ^awe is the number of samples that are eventually element-wise averaged to 
produce one single vector with class confidences. The previous top-reported with regular sampling and our results are shown in bold. 


patches in all. Following the methodology from Simonyan 
et al. [40], it is comparable to deploy ^ = 3 scales and ex¬ 
tract C = 50 crops per scale (5x5 regular grid with flips), 
for a total of 150 crops over 3 scales (row 5 in Table 2). 

The results, presented in Table 2, indicate as expected 
that scale jittering at test time improves the classification 
performance for both 10-crop and 50-crop strategies. Addi¬ 
tionally, the 50-crop strategy is better than the 10-crop strat¬ 
egy for both models. The results on row 5 in bold are the 
lowest errors that can be achieved with these specific single 
models"^ using only regular crops. 

Then we present our methods and observe that using the 
AlexNet network with D = 10 concentric domain sizes out¬ 
performs most multi-crop algorithms even if it only eval¬ 
uates and averages 10 patches. Furthermore, combining it 
with 10 common crops achieves the best results for both net¬ 
works, even without using 3-scale jittering. One interpreta¬ 
tion for these improvements is that the concentric samples 
serve a natural prior for the majority of ILSVRC images, 

^Specifically, we use the VGG16 model which is trained without scale 
jittering at training and appears on the first row of D area in Table 3 in [40]. 
Pre-trained models for both AlexNet and VGG16 are publicly available 
with the MatConvNet toolbox [45]. Simonyan et al. in their evaluation 
with 50 crops and 3 scales report 8.6% top-5 error on ImageNet 2014 
validation. In contrast our implementation produces 8.85%, which can be 
attributed to using a different pre-trained model, as the initial weights are 
sampled from a zero-mean Gaussian distribution with standard deviation 
0.01 and there might also be minor differences in the training process. 


i.e., the object of interest lies most probably at the center 
than at the image boundaries. This is a common assump¬ 
tion in the literature that also appears in large-scale video 
segmentation [26]. 

Following, we introduce the adaptive sampling mecha¬ 
nism with Algorithm 1 and reduce the top-5 error to 15.83% 
and 8.01% for AlexNet and VGG16 respectively. To set this 
in perspective, Krizhevsky et al. [28] report 16.4% top-5 
error when they combine 5 models. We improve this per¬ 
formance with one single model. The relative improvement 
for the deployed instances of AlexNet and VGG16, com¬ 
pared to the data-augmentation methods used in [40, 42], is 
9.9% and 9.4%, respectively. Row 14 shows results where 
the marginalization is weighted based on the entropy (no¬ 
tated as W), while the methods in rows 9-13 use uniform 
weights (c.f. Algorithm 1). At the last row we show results 
from the ILSVRC test server for our top-performing method 
(row 13). 

Regular and concentric crops assume that objects occupy 
most of the image or appear near the center. This is a known 
bias in the ImageNet dataset. To analyze the effect of adap¬ 
tive sampling, we calculate the intersection over union error 
between the objects and the regular and concentric crops, 
and show in Fig. 4 the performance of various methods as a 
function of the loU error. The improvement of using adap¬ 
tive sampling (via proposals) over only regular and concen- 
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Figure 4. Classification error as a function of the loU error between 
the objects and the regular and concentric crops. 

trie crops is increased as loU error grows, indicating that 
objects occupy less domain or are far away from the center. 

Time complexity. In Table 2 we show the number of 
evaluated samples (^eval) and the subset that is actually 
averaged {^ave) to extract a single class posterior vector. 
The sequential time needed for each method is linear 
to the number of evaluated patches ^eval. We run the 
experiments with the MatConvNet library and parallelize 
the load for VGG16 so that the testing is done in batches 
of i? = 20 patches. We report the time profile^ for each 
method in Table 2. A few entries cover two boxes, as their 
methods are evaluated together. Extracting the proposals is 
not a major bottleneck if using an efficient algorithm [24], 
such as Edge Boxes [48]. In rows 13-14 we report results 
of our faster version, where the Edge Boxes do not leverage 
edge sharpening and use one decision tree. Overall, 
compared to the 150-crop strategy, the object proposal 
scheme introduces marginal computational overhead. 

2.2. Wide-Baseline Correspondence 

We test the effect of domain-size pooling in correspon¬ 
dence tasks with a convolutional architecture, as done by 
[12] for SIET [30], using the datasets and protocols of [16]. 
This is illustrated in Eig. 2 (upper right), but here the domain 
sizes are centered around the detector. We expect that such 
averaging will increase the discriminability of detected re¬ 
gions and in turn the matching ability, similar to the benefits 
that we see on the last rows of Table 1. 

We use maximally-stable extremal regions (MSER) [32] 
to detect candidate regions, affine-normalize them, align 
them to the dominant orientation, and re-scale them for 

^We use a machine equipped with a NVIDIA Tesla K80 GPU, 24 Intel 
Xeon E5 cores and 64G RAM memory. 


head-to-head comparisons. Eor a detected scale a at each 
MSER, the DSP-CNN samples D domain sizes within a 
neighborhood [Aicr, A 2 cr] around it, computes the CNN re¬ 
sponses on these samples and averages the posteriors. The 
deployed deep network is the unsupervised convolutional 
network proposed by [16], which is trained with surro¬ 
gate labels from an unlabeled dataset (see the methodol¬ 
ogy in [13]), with the objective of being invariant to sev¬ 
eral transformations that are commonly observed in images 
captured from different viewpoints. As opposed to network- 
classifiers, here the task is correspondence and the network 
is purely a region descriptor, whose last two layers (3 and 
4) are the representations. 

In Eig. 5 (left) we show the comparison between CNN 
and DSP-CNN on Oxford dataset [33]. CNN’s layer 4 is 
the representation for each MSER and DSP-CNN simply 
averages this layer’s responses for all D domain sizes. We 
use Ai = 0.7, A 2 = 1.5 and D = 6 sizes that are uniformly 
sampled in this neighborhood. There is a 15.1% improve¬ 
ment based on the matching mean average precision. 

Eischer’s dataset [16] includes 400 pairs of images, some 
of them with more extreme transformations than those in 
the Oxford dataset. The types of transformations include 
zooming, blurring, lighting change, rotation, perspective 
and nonlinear transformations. In Eig. 5 (center) and Table 
3 we show comparisons between CNN and DSP-CNN for 
layer-3 and layer-4 representations and demonstrate 7.7% 
and 5.0% relative improvement. WeuseAi = 0.5, A 2 = 1.4 
and D = 10 domain sizes. These parameters are selected 
with cross-validation. In Table 3 we show comparisons with 
baselines, such as using the raw data and DSP-SIET [12]. 
After fine parameter search (Ai = 0.5, A 2 = 1.24) and con¬ 
catenating the layers 3 and 4, we achieve state of the art 
performance as shown in Pig. 5 (right), observing though 
the high dimensionality of this method compared to local 
descriptors. 


Method 

Dim 

mAP 

Raw patch 

4,761 

34.79 

SIET [30] 

128 

45.32 

DSP-SIET [12] 

128 

53.72 

CNN-L3 [16] 

9,216 

48.99 

CNN-L4 [16] 

8,192 

50.55 

DSP-CNN-L3 

9,216 

52.76 

DSP-CNN-L4 

8,192 

53.07 

DSP-CNN-L3-L4 

17,408 

53.74 

DSP-CNN-L3 (PCA128) 

128 

51.45 

DSP-CNN-L4 (PCA128) 

128 

52.33 

DSP-CNN-L34 (concat. PCA128) 

256 

52.69 


Table 3. Matching mean average precision for different approaches 
on Fischer’s dataset [16], 
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Figure 5. Head to head comparison between CNN and DSP-CNN on the Oxford [33] (left) and Fischer’s [16] (center) datasets. The layer-4 
features of the unsupervised network from [16] are used as descriptors. The DSP-CNN outperforms its CNN counterpart in terms of 
matching mAP by 15.1% and 5.0%, respectively. Right: DSP-CNN performs comparably to the state-of-the-art DSP-SIFT descriptor [12]. 


Given the inherent high-dimensionality of CNN layers, 
we perform dimensionality reduction with principal com¬ 
ponent analysis to investigate how this affects the matching 
performance. In Table 3 we show the performance for com¬ 
pressed layer-3 and layer-4 representations with PC A to 128 
dimensions and their concatenation. There is a modest per¬ 
formance loss, yet the compressed features outperform the 
single-scale features by a large margin. 

3. Discussion 

Our empirical analysis indicates that CNNs, that are de¬ 
signed to be invariant to nuisance variability due to small 
planar translations - by virtue of their convolutional archi¬ 
tecture and local spatial pooling - and learned to manage 
global translation, distance (scale) and shape (aspect ratio) 
variability by means of large annotated datasets, in prac¬ 
tice are less effective than a naive and in theory counter¬ 
productive practice of sampling and averaging the condi¬ 
tionals based on an ad-hoc choice of bounding boxes and 
their corresponding planar translation, scale and aspect ra¬ 
tio. 

This has to be taken with the due caveats: First, we 
have shown the statement empirically for few choices of 
network architectures (AlexNet and VGG), trained on par¬ 
ticular datasets that are unlikely to be representative of the 
complexity of visual scenes (although they may be repre¬ 
sentative of the same scenes as portrayed in the test set), 
and with a specific choice of parameters made by their re¬ 
spective authors, both for the classifier and for the evalua¬ 
tion protocol. To test the hypothesis in the fairest possible 
setting, we have kept all these choices constant while com¬ 
paring a CNN trained, in theory, to “marginalize” the nui¬ 
sances thus described, with the same applied to bounding 
boxes provided by a proposal mechanism. To address the 


arbitrary choice of proposals, we have employed those used 
in the current state-of-the-art methods, but we have found 
the results representative of other choices of proposals. 

In addition to answering the question posed in the in¬ 
troduction, along the way we have shown that by framing 
the marginalization of nuisance variables as the averaging 
of a sub-sampling of marginal distributions we can leverage 
of concepts from classical sampling theory to anti-alias the 
overall classifier, which leads to a performance improve¬ 
ment both in categorization, as measured in the ImageNet 
benchmark, and correspondence, as measured in the Oxford 
and Fischer’s matching benchmarks. 

Of course, like any universal approximator, a CNN can in 
principle capture the geometry of the discriminant surface 
by “learning away” nuisance variability, given sufficient re¬ 
sources in terms of layers, number of filters, and number 
of training samples. So in the abstract sense a CNN can 
indeed marginalize out nuisance variability. The analysis 
conducted show that, at the level of complexity imposed by 
current architectures and training set, it does so less effec¬ 
tively than ad-hoc averaging of proposal distributions. 

This leaves researchers the choice of investing more ef¬ 
fort in the design of proposal mechanisms [ 18, 36], subtract¬ 
ing duties from the Category CNN downstream, or invest 
more effort in scaling up the size and efficiency of learning 
algorithms for general CNNs so as to render the need for a 
proposal scheme moot. 
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